Adrian Marino

Claudio Collado

Inicialización

Fijamos la semilla para poder reproducir los experimentos. También se fija el numero de CPU’s a utilizar.

set.seed(42)
options(mc.cores = 24)

Librerias

Se importan las librerías a utilizar a lo largo de la notebook:

# install.packages(pacman)
# install.packages("https://cran.r-project.org/src/contrib/rstan_2.21.2.tar.gz",repos = NULL,type="source")
# sudo apt-get install libglpk-dev
library(pacman)
p_load(tidyverse, tidymodels, rsample, rstan, shinystan, rstanarm, devtools)
p_load_gh('adrianmarino/commons')

import('../src/dataset.R')
[1] "-> '../src/dataset.R' script loadded successfuly!"
import('../src/plot.R')
[1] "-> '../src/plot.R' script loadded successfuly!"
import('../src/model.R')
[1] "-> './bayesian_regression_predictor.R' script loadded successfuly!"
[1] "-> '../src/model.R' script loadded successfuly!"

Dataset y Analisis Exploratorio

Palmer Penguins

Palmer Penguins

1. Lectura del dataset

dataset <- load_dataset() %>% mutate_if(is.character, as.factor)
Rows: 344 Columns: 8
── Column specification ──────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (3): species, island, sex
dbl (5): bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, year

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dataset %>% glimpse()
Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Ad…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, 42.0, 37.8, 37.8, 41.1…
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, 20.2, 17.1, 17.3, 17.6…
$ flipper_length_mm <dbl> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186, 180, 182, 191, 198, …
$ body_mass_g       <dbl> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, 4250, 3300, 3700, 3200…
$ sex               <fct> male, female, female, NA, female, male, female, male, NA, NA, NA, NA, fema…
$ year              <dbl> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 20…

2. Variables

Se enumeran y describen breve-mente cada variable que forma parte del dataset:

Variables Numéricas:

  • bill_length_mm: Longitud del pico del individuo medida en milímetros (también conocida como longitud del culmen).
  • bill_depth_mm: Profundidad del pico del individuo medida en milímetros (también conocida como profundidad del culmen).
  • flipper_length_mm: Longitud de la aleta del individuo medida en milímetros.
  • body_mass_g: Masa corporal del individuo medida en gramos.
  • year: Año en el que se registra el individuo.

Variables Categóricas:

  • species: Especie del individuo (Adelie, Gentoo ;) o Chinstrap).
  • sex: Sexo del individuo.
  • island: Isla donde se encontré el individuo (Biscoe, Dream o Torgersen).

A continuación veamos los posibles valores de las variables categóricas:

show_values(dataset %>% select_if(negate(is.numeric)))
_____________ 
species     n 
============= 
Adelie    152 
Chinstrap  68 
Gentoo    124 
¯¯¯¯¯¯¯¯¯¯¯¯¯ 
_____________ 
island      n 
============= 
Biscoe    168 
Dream     124 
Torgersen  52 
¯¯¯¯¯¯¯¯¯¯¯¯¯ 
__________ 
sex      n 
========== 
female 165 
male   168 
NA      11 
¯¯¯¯¯¯¯¯¯¯ 

3. Resumen de faltantes

missings_summary(dataset)

4. Varibles numericas

hist_plots(dataset)

Observaciones

  • Se aprecia que cada año se registro prácticamente el mismo numero de individuos.
  • La distribución de la masa corporal de los individuos tiene una asimétrica positiva. Tenemos muchos individuos con valores bajos de masa corporal, con una media de 192 gramos. Luego tenemos menos individuos con valores mas alto.
  • La longitud de la aleta parece ser una distribución bi-modal. Tenemos dos modas una 192 mm y otra en 215 mm.
  • La longitud del pico también parece tener una ligera simetría positiva. Es decir que lo individuos con menor peso tiene pico mas pequeños.
  • Por otro lado la profundidad de pico parece tener una ligera simetría positiva.
box_plots(dataset)

Observaciones

  • COMPLETAR

Outliers

No se registran valores mas extremos que el mínimo y máximo valor en cada variables. Es decir que no encontramos outliers.

outliers(dataset, column='bill_length_mm')
$inf
[1] 32.1

$sup
[1] 59.6

outliers(dataset, column='bill_length_mm')
$inf
[1] 32.1

$sup
[1] 59.6

outliers(dataset, column='bill_depth_mm')
$inf
[1] 13.1

$sup
[1] 21.5

outliers(dataset, column='flipper_length_mm')
$inf
[1] 172

$sup
[1] 231

outliers(dataset, column='body_mass_g')
$inf
[1] 2700

$sup
[1] 6300

outliers(dataset, column='year')
$inf
[1] 2007

$sup
[1] 2009

bar_plots(dataset)

Observaciones

  • La variable sexo se encuentra balanceada. Por otro lado, contiene algunos valores faltantes.
  • La variable island esta completamente desbalanceada. Esto seguramente se debe a una diferencia en numero en las poblaciones en cada isla o a un sesgo al momento de registrar los individuos. Es decir que registramos con individuos en una isla que en otra.
  • Lo mismo sucede con las especies de individuos. Vemos un gran desbalance entre la especie Chinstrap vs. otra especies. Por otro aldo Adelie y Gentoo tiene un conteo mas cercano

5. Excluir observaciones con missings

dataset <- dataset %>% drop_na()
missings_summary(dataset)
Warning: attributes are not identical across measure variables;
they will be dropped

6. Correlaciones

corr_plot(dataset %>% dplyr::select(-year))

segmented_pairs_plot(dataset, segment_column='species')
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Experimentos

Experimento 1

  • Solo variables numéricas.
  • Regresión múltiple frecuentista.
  • Regresión múltiple bayesiana con priors normales y exponencial.

1. Split train vs. test

train_test <- train_test_split(dataset, train_size = 0.7, shuffle = TRUE)
[1] "Train set size: 233"
[1] "Test set size: 100"
train_set <- train_test[[1]]
test_set  <- train_test[[2]]

2. Modelo lineal

lineal_model_1 <- lm(
  body_mass_g ~ bill_length_mm + bill_depth_mm + flipper_length_mm,
  data = train_set
)

3. Modelo bayesiano

bayesion_model_1 <- stan(
  model_code =  "
    data {
      int<lower=1>               obs_count;
      vector<lower=1>[obs_count] x1;
      vector<lower=1>[obs_count] x2;
      vector<lower=1>[obs_count] x3;
      vector[obs_count]          y;
    }
    parameters {
      real          beta0;
      real          beta1;
      real          beta2;
      real          beta3;
      real<lower=0> sigma;
    }
    model {
      beta0 ~ normal(0, 8000);
      beta1 ~ normal(0, 100);
      beta2 ~ normal(0, 100);
      beta3 ~ normal(0, 100);
      sigma ~ exponential(0.1);
    
      y ~ normal(beta0 + beta1 * x1 + beta2 * x2 + beta3 * x3, sigma);
    }
  ",
  data = list(
      obs_count = nrow(train_set),
      y  = colvalues(train_set, 'body_mass_g'),
      x1 = colvalues(train_set, 'bill_length_mm'),
      x2 = colvalues(train_set, 'bill_depth_mm'),
      x3 = colvalues(train_set, 'flipper_length_mm')
  ),
  chains = 3,
  iter   = 300,
  warmup = 180,
  thin   = 1
)
starting worker pid=15783 on localhost:11189 at 17:22:19.265
starting worker pid=15820 on localhost:11189 at 17:22:19.372
starting worker pid=15857 on localhost:11189 at 17:22:19.480

SAMPLING FOR MODEL '075149bad897790d99f42d10fc339899' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 3.7e-05 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.37 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: Iteration:   1 / 300 [  0%]  (Warmup)
Chain 1: Iteration:  30 / 300 [ 10%]  (Warmup)
Chain 1: Iteration:  60 / 300 [ 20%]  (Warmup)
Chain 1: Iteration:  90 / 300 [ 30%]  (Warmup)
Chain 1: Iteration: 120 / 300 [ 40%]  (Warmup)
Chain 1: Iteration: 150 / 300 [ 50%]  (Warmup)
Chain 1: Iteration: 180 / 300 [ 60%]  (Warmup)
Chain 1: Iteration: 181 / 300 [ 60%]  (Sampling)
Chain 1: Iteration: 210 / 300 [ 70%]  (Sampling)
Chain 1: Iteration: 240 / 300 [ 80%]  (Sampling)
Chain 1: Iteration: 270 / 300 [ 90%]  (Sampling)
Chain 1: Iteration: 300 / 300 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 0.847551 seconds (Warm-up)
Chain 1:                0.301735 seconds (Sampling)
Chain 1:                1.14929 seconds (Total)
Chain 1: 

SAMPLING FOR MODEL '075149bad897790d99f42d10fc339899' NOW (CHAIN 2).
Chain 2: 
Chain 2: Gradient evaluation took 5.5e-05 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.55 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2: 
Chain 2: 
Chain 2: Iteration:   1 / 300 [  0%]  (Warmup)
Chain 2: Iteration:  30 / 300 [ 10%]  (Warmup)
Chain 2: Iteration:  60 / 300 [ 20%]  (Warmup)
Chain 2: Iteration:  90 / 300 [ 30%]  (Warmup)
Chain 2: Iteration: 120 / 300 [ 40%]  (Warmup)
Chain 2: Iteration: 150 / 300 [ 50%]  (Warmup)
Chain 2: Iteration: 180 / 300 [ 60%]  (Warmup)
Chain 2: Iteration: 181 / 300 [ 60%]  (Sampling)

SAMPLING FOR MODEL '075149bad897790d99f42d10fc339899' NOW (CHAIN 3).
Chain 3: 
Chain 3: Gradient evaluation took 2.8e-05 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.28 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3: 
Chain 3: 
Chain 3: Iteration:   1 / 300 [  0%]  (Warmup)
Chain 3: Iteration:  30 / 300 [ 10%]  (Warmup)
Chain 2: Iteration: 210 / 300 [ 70%]  (Sampling)
Chain 2: Iteration: 240 / 300 [ 80%]  (Sampling)
Chain 3: Iteration:  60 / 300 [ 20%]  (Warmup)
Chain 2: Iteration: 270 / 300 [ 90%]  (Sampling)
Chain 2: Iteration: 300 / 300 [100%]  (Sampling)
Chain 2: 
Chain 2:  Elapsed Time: 0.996172 seconds (Warm-up)
Chain 2:                0.588716 seconds (Sampling)
Chain 2:                1.58489 seconds (Total)
Chain 2: 
Chain 3: Iteration:  90 / 300 [ 30%]  (Warmup)
Chain 3: Iteration: 120 / 300 [ 40%]  (Warmup)
Chain 3: Iteration: 150 / 300 [ 50%]  (Warmup)
Chain 3: Iteration: 180 / 300 [ 60%]  (Warmup)
Chain 3: Iteration: 181 / 300 [ 60%]  (Sampling)
Chain 3: Iteration: 210 / 300 [ 70%]  (Sampling)
Chain 3: Iteration: 240 / 300 [ 80%]  (Sampling)
Chain 3: Iteration: 270 / 300 [ 90%]  (Sampling)
Chain 3: Iteration: 300 / 300 [100%]  (Sampling)
Chain 3: 
Chain 3:  Elapsed Time: 1.32862 seconds (Warm-up)
Chain 3:                0.594813 seconds (Sampling)
Chain 3:                1.92344 seconds (Total)
Chain 3: 
params_1 <- c('beta0', 'beta1', 'beta2', 'beta3', 'sigma')
traceplot(bayesion_model_1, pars = params_1, inc_warmup = TRUE)

4. Coeficientes

lm_vs_br_coeficients(lineal_model_1, bayesion_model_1, params_1)

4. Validación

vars_1 <- c('bill_length_mm', 'bill_depth_mm', 'flipper_length_mm') 

lm_vs_br_models_validation(lineal_model_1, bayesion_model_1, params_1, vars_1, test_set)
bayesion_predictor_1 <- BayesianRegressionPredictor.from(bayesion_model_1, params_1, vars_1)

plot_compare_fit(
  lineal_model_1, 
  bayesion_predictor_1, 
  train_set,
  label_1='Regresion Lineal', 
  label_2='Regresion Bayesiana'
)

Experimento 2

  • Idem a experimento 1, incorporando una variable categórica.
  • Regresión múltiple frecuentista.
  • Regresión bayesiana con priors normales y exponencial.

1. Modelo lineal

lineal_model_2 <- lm(
    body_mass_g 
      ~ bill_length_mm
      + bill_depth_mm
      + flipper_length_mm
      + sex,

  data = train_set
)

2. Modelo bayesiano

Antes que nada transformamos la columna categórica a one-hot encoding:

cat_cols <- c('sex')

train_set_cat <- train_set %>% dummify(cat_cols)
test_set_cat  <- test_set  %>% dummify(cat_cols)

train_set_cat

Construimos una matriz con todas las variables(X) mas el intercept:

to_model_input <- function(df) {
  model.matrix(
    body_mass_g 
      ~ bill_length_mm
      + bill_depth_mm
      + flipper_length_mm
      + sex_female
      + sex_male,

    data = df
  )
}

train_X <- train_set_cat %>% to_model_input()
test_X  <- test_set_cat %>% to_model_input()

Definimos y corremos el modelo bayesiano:

bayesion_model_2 <- stan(
  model_code = "
    data {
      int<lower=1>                 obs_count;
      int<lower=1>                 coef_count;
      matrix[obs_count,coef_count] X;
      vector[obs_count]            y;
    }
    parameters {
      vector[coef_count]  beta;
      real<lower=0>       sigma;
    }
    model {
      beta[1] ~ normal(0, 2000);
      
      beta[2] ~ normal(0, 30);
      beta[3] ~ normal(0, 100);
      beta[4] ~ normal(0, 100);
  
      beta[5] ~ normal(0, 100);
      beta[6] ~ normal(0, 100);
  
      sigma ~ exponential(0.1);
  
      y ~ normal(X * beta, sigma);
    }
  ",
  data = list(
      obs_count  = dim(train_X)[1],
      coef_count = dim(train_X)[2],
      y          = colvalues(train_set_cat, 'body_mass_g'),
      X          = train_X
  ),
  chains = 3,
  iter   = 300,
  warmup = 180,
  thin   = 1
)
starting worker pid=16059 on localhost:11189 at 17:22:52.890
starting worker pid=16096 on localhost:11189 at 17:22:52.997
starting worker pid=16133 on localhost:11189 at 17:22:53.107

SAMPLING FOR MODEL '23ab4c58649acf62fe6953fd9b605e50' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 1.9e-05 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.19 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: Iteration:   1 / 300 [  0%]  (Warmup)
Chain 1: Iteration:  30 / 300 [ 10%]  (Warmup)
Chain 1: Iteration:  60 / 300 [ 20%]  (Warmup)
Chain 1: Iteration:  90 / 300 [ 30%]  (Warmup)
Chain 1: Iteration: 120 / 300 [ 40%]  (Warmup)
Chain 1: Iteration: 150 / 300 [ 50%]  (Warmup)
Chain 1: Iteration: 180 / 300 [ 60%]  (Warmup)
Chain 1: Iteration: 181 / 300 [ 60%]  (Sampling)
Chain 1: Iteration: 210 / 300 [ 70%]  (Sampling)
Chain 1: Iteration: 240 / 300 [ 80%]  (Sampling)
Chain 1: Iteration: 270 / 300 [ 90%]  (Sampling)
Chain 1: Iteration: 300 / 300 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 0.396033 seconds (Warm-up)
Chain 1:                0.142786 seconds (Sampling)
Chain 1:                0.538819 seconds (Total)
Chain 1: 

SAMPLING FOR MODEL '23ab4c58649acf62fe6953fd9b605e50' NOW (CHAIN 2).
Chain 2: 
Chain 2: Gradient evaluation took 1.8e-05 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.18 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2: 
Chain 2: 
Chain 2: Iteration:   1 / 300 [  0%]  (Warmup)
Chain 2: Iteration:  30 / 300 [ 10%]  (Warmup)
Chain 2: Iteration:  60 / 300 [ 20%]  (Warmup)
Chain 2: Iteration:  90 / 300 [ 30%]  (Warmup)
Chain 2: Iteration: 120 / 300 [ 40%]  (Warmup)
Chain 2: Iteration: 150 / 300 [ 50%]  (Warmup)
Chain 2: Iteration: 180 / 300 [ 60%]  (Warmup)
Chain 2: Iteration: 181 / 300 [ 60%]  (Sampling)
Chain 2: Iteration: 210 / 300 [ 70%]  (Sampling)
Chain 2: Iteration: 240 / 300 [ 80%]  (Sampling)
Chain 2: Iteration: 270 / 300 [ 90%]  (Sampling)

SAMPLING FOR MODEL '23ab4c58649acf62fe6953fd9b605e50' NOW (CHAIN 3).
Chain 3: 
Chain 3: Gradient evaluation took 1.7e-05 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.17 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3: 
Chain 3: 
Chain 3: Iteration:   1 / 300 [  0%]  (Warmup)
Chain 2: Iteration: 300 / 300 [100%]  (Sampling)
Chain 2: 
Chain 2:  Elapsed Time: 0.459342 seconds (Warm-up)
Chain 2:                0.144679 seconds (Sampling)
Chain 2:                0.604021 seconds (Total)
Chain 2: 
Chain 3: Iteration:  30 / 300 [ 10%]  (Warmup)
Chain 3: Iteration:  60 / 300 [ 20%]  (Warmup)
Chain 3: Iteration:  90 / 300 [ 30%]  (Warmup)
Chain 3: Iteration: 120 / 300 [ 40%]  (Warmup)
Chain 3: Iteration: 150 / 300 [ 50%]  (Warmup)
Chain 3: Iteration: 180 / 300 [ 60%]  (Warmup)
Chain 3: Iteration: 181 / 300 [ 60%]  (Sampling)
Chain 3: Iteration: 210 / 300 [ 70%]  (Sampling)
Chain 3: Iteration: 240 / 300 [ 80%]  (Sampling)
Chain 3: Iteration: 270 / 300 [ 90%]  (Sampling)
Chain 3: Iteration: 300 / 300 [100%]  (Sampling)
Chain 3: 
Chain 3:  Elapsed Time: 0.435183 seconds (Warm-up)
Chain 3:                0.123199 seconds (Sampling)
Chain 3:                0.558382 seconds (Total)
Chain 3: 
params_2 <- c('beta[1]', 'beta[2]', 'beta[3]', 'beta[4]', 'beta[5]', 'beta[6]', 'sigma')
traceplot(bayesion_model_2, inc_warmup = TRUE, pars = params_2)

3. Coeficientes

lineal_model_2$coefficients
      (Intercept)    bill_length_mm     bill_depth_mm flipper_length_mm           sexmale 
    -1922.6152066        -0.5330962       -92.9216684        37.2016333       534.1583252 
br_coeficients(bayesion_model_2, params_2)

4. Validación

vars_2 <- c('bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'sex_female','sex_male')

lm_vs_br_models_validation(
  lineal_model_2, 
  bayesion_model_2, 
  params_2,
  vars_2,
  test_set, 
  test_set_cat
)
bayesion_predictor_2 <- BayesianRegressionPredictor.from(bayesion_model_2, params_2, vars_2)

plot_compare_fit(
  lineal_model_2, 
  bayesion_predictor_2, 
  test_set, 
  test_set_cat,
  label_1='Regresion Lineal', 
  label_2='Regresion Bayesiana'
)

Experimento 3

  • Idem a experimento 1 incorporando outliers en alguna variable numérica.
  • Regresión múltiple frecuentista.
  • Regresión bayesiana con priors normales y exponencial.

1. Outliers

A continuación vamos a agregar outliers a la variable flipper_length_mm, la cual define la longitud de la aleta del individuo medida en milímetros.

Para visualizar los nuevos outliers a continuación graficamos flipper_length_mm vs. body_mass_g antes y después de agregar outliers:

plot_data(train_set)

train_set_with_outliers <- train_set %>%
  mutate(flipper_length_mm = ifelse(
    body_mass_g > 4900 & body_mass_g < 5000, 
    flipper_length_mm + (flipper_length_mm * runif(1, 0.1, 0.2)), 
    flipper_length_mm
  ))

plot_data(train_set_with_outliers)

2. Modelo lineal

lineal_model_3 <- lm(
  body_mass_g ~ bill_length_mm + bill_depth_mm + flipper_length_mm,
  data = train_set_with_outliers
)

Comparemos el ajuste del modelo lineal ajustando en un dataset de train con y sin outliers:

plot_compare_fit(
  lineal_model_1, 
  lineal_model_3, 
  train_set_with_outliers,
  label_1='Regresión Lineal SIN outliers', 
  label_2='Regresión Lineal CON outliters'
)

Se aprecia que el modelo entrenado en el train set outliers esta apalancado por las observaciones atipicas.

2. Modelo lineal Robusto

Entrenamos una regresión lineal múltiple robusta para intentar de disminuir el efecto de los nuevos outliers.

p_load(MASS)
robust_lineal_model_3 <- rlm(
  body_mass_g ~ bill_length_mm + bill_depth_mm + flipper_length_mm,
  data = train_set_with_outliers
)
plot_compare_fit(
  lineal_model_1, 
  robust_lineal_model_3, 
  train_set_with_outliers,
  label_1='Regresión Lineal SIN outliers', 
  label_2='Regresión Lineal Robusta CON outliters'
)

Gráficamente no se llega a distinguir pero el modelo robusto termina ajustando mejor que el modelo lineal clásico. Esto se puede observar cuando comparamos el RMSE/MAE sobre train.

lm_vs_lm_models_validation(lineal_model_3, robust_lineal_model_3, train_set_with_outliers)
Warning in actual - predicted :
  longer object length is not a multiple of shorter object length
Warning in actual - predicted :
  longer object length is not a multiple of shorter object length
Warning in actual - predicted :
  longer object length is not a multiple of shorter object length
Warning in actual - predicted :
  longer object length is not a multiple of shorter object length

Mas alla de esto el modelo clasico sigue dando mejores resultado en test:

lm_vs_lm_models_validation(lineal_model_3, robust_lineal_model_3, test_set)

3. Modelo bayesiano con Likelihood normal

bayesion_model_3 <- stan(
  model_code =  "
    data {
      int<lower=1>               obs_count;
      vector<lower=1>[obs_count] x1;
      vector<lower=1>[obs_count] x2;
      vector<lower=1>[obs_count] x3;
      vector[obs_count]          y;
    }
    parameters {
      real          beta0;
      real          beta1;
      real          beta2;
      real          beta3;
      real<lower=0> sigma;
    }
    model {
      beta0 ~ normal(0, 8000);
      beta1 ~ normal(0, 100);
      beta2 ~ normal(0, 100);
      beta3 ~ normal(0, 100);
      sigma ~ exponential(0.1);
    
      y ~ normal(beta0 + beta1 * x1 + beta2 * x2 + beta3 * x3, sigma);
    }
  ",
  data = list(
      obs_count = nrow(train_set_with_outliers),
      y  = colvalues(train_set_with_outliers, 'body_mass_g'),
      x1 = colvalues(train_set_with_outliers, 'bill_length_mm'),
      x2 = colvalues(train_set_with_outliers, 'bill_depth_mm'),
      x3 = colvalues(train_set_with_outliers, 'flipper_length_mm')
  ),
  chains = 3,
  iter   = 300,
  warmup = 180,
  thin   = 1
)
starting worker pid=16230 on localhost:11189 at 17:23:05.825
starting worker pid=16267 on localhost:11189 at 17:23:05.936
starting worker pid=16304 on localhost:11189 at 17:23:06.043

SAMPLING FOR MODEL '075149bad897790d99f42d10fc339899' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 3.4e-05 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.34 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: Iteration:   1 / 300 [  0%]  (Warmup)
Chain 1: Iteration:  30 / 300 [ 10%]  (Warmup)
Chain 1: Iteration:  60 / 300 [ 20%]  (Warmup)
Chain 1: Iteration:  90 / 300 [ 30%]  (Warmup)
Chain 1: Iteration: 120 / 300 [ 40%]  (Warmup)
Chain 1: Iteration: 150 / 300 [ 50%]  (Warmup)

SAMPLING FOR MODEL '075149bad897790d99f42d10fc339899' NOW (CHAIN 2).
Chain 2: 
Chain 2: Gradient evaluation took 3.4e-05 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.34 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2: 
Chain 2: 
Chain 2: Iteration:   1 / 300 [  0%]  (Warmup)
Chain 1: Iteration: 180 / 300 [ 60%]  (Warmup)
Chain 1: Iteration: 181 / 300 [ 60%]  (Sampling)
Chain 2: Iteration:  30 / 300 [ 10%]  (Warmup)
Chain 1: Iteration: 210 / 300 [ 70%]  (Sampling)
Chain 1: Iteration: 240 / 300 [ 80%]  (Sampling)
Chain 2: Iteration:  60 / 300 [ 20%]  (Warmup)

SAMPLING FOR MODEL '075149bad897790d99f42d10fc339899' NOW (CHAIN 3).
Chain 3: 
Chain 3: Gradient evaluation took 4.1e-05 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.41 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3: 
Chain 3: 
Chain 3: Iteration:   1 / 300 [  0%]  (Warmup)
Chain 3: Iteration:  30 / 300 [ 10%]  (Warmup)
Chain 1: Iteration: 270 / 300 [ 90%]  (Sampling)
Chain 1: Iteration: 300 / 300 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 0.803172 seconds (Warm-up)
Chain 1:                0.458477 seconds (Sampling)
Chain 1:                1.26165 seconds (Total)
Chain 1: 
Chain 3: Iteration:  60 / 300 [ 20%]  (Warmup)
Chain 2: Iteration:  90 / 300 [ 30%]  (Warmup)
Chain 2: Iteration: 120 / 300 [ 40%]  (Warmup)
Chain 3: Iteration:  90 / 300 [ 30%]  (Warmup)
Chain 2: Iteration: 150 / 300 [ 50%]  (Warmup)
Chain 2: Iteration: 180 / 300 [ 60%]  (Warmup)
Chain 2: Iteration: 181 / 300 [ 60%]  (Sampling)
Chain 3: Iteration: 120 / 300 [ 40%]  (Warmup)
Chain 2: Iteration: 210 / 300 [ 70%]  (Sampling)
Chain 2: Iteration: 240 / 300 [ 80%]  (Sampling)
Chain 3: Iteration: 150 / 300 [ 50%]  (Warmup)
Chain 2: Iteration: 270 / 300 [ 90%]  (Sampling)
Chain 2: Iteration: 300 / 300 [100%]  (Sampling)
Chain 2: 
Chain 2:  Elapsed Time: 1.15532 seconds (Warm-up)
Chain 2:                0.307839 seconds (Sampling)
Chain 2:                1.46316 seconds (Total)
Chain 2: 
Chain 3: Iteration: 180 / 300 [ 60%]  (Warmup)
Chain 3: Iteration: 181 / 300 [ 60%]  (Sampling)
Chain 3: Iteration: 210 / 300 [ 70%]  (Sampling)
Chain 3: Iteration: 240 / 300 [ 80%]  (Sampling)
Chain 3: Iteration: 270 / 300 [ 90%]  (Sampling)
Chain 3: Iteration: 300 / 300 [100%]  (Sampling)
Chain 3: 
Chain 3:  Elapsed Time: 1.15584 seconds (Warm-up)
Chain 3:                0.50189 seconds (Sampling)
Chain 3:                1.65773 seconds (Total)
Chain 3: 
params_3 <- c('beta0', 'beta1', 'beta2', 'beta3', 'sigma')
traceplot(bayesion_model_3, pars = params_3, inc_warmup = TRUE)

4. Coeficientes

lm_vs_br_coeficients(robust_lineal_model_3, bayesion_model_3, params_3)

5. Validación

vars_3 <- c('bill_length_mm', 'bill_depth_mm', 'flipper_length_mm') 

lm_vs_br_models_validation(robust_lineal_model_3, bayesion_model_3, params_3, vars_3, test_set)
bayesion_predictor_3 <- BayesianRegressionPredictor.from(bayesion_model_3, params_3, vars_3)

plot_compare_fit(
  robust_lineal_model_3, 
  bayesion_predictor_3, 
  train_set,
  label_1='Regresion Lineal Robusta CON outliers', 
  label_2='Regresion Bayesiana CON outliers'
)

6. Modelo bayesiano con Likelihood t-student

Experimento 4

  • Idem experimento 1 pero reduciendo la cantidad de observaciones a pocos valores (ej:30).
  • Regresion multiple frecuentista.
  • Regresion bayesiana con priors normales y exponencial.

1. Split train - test

En este aso entrenamos solo con el 10% de lo datos.

train_test <- train_test_split(dataset, train_size = 0.05, shuffle = TRUE)
[1] "Train set size: 16"
[1] "Test set size: 317"
train_set_4 <- train_test[[1]]
test_set_4  <- train_test[[2]]
plot_data(train_set_4)

2. Modelo lineal

lineal_model_4 <- lm(
  body_mass_g ~ bill_length_mm + bill_depth_mm + flipper_length_mm,
  data = train_set_4
)

3. Modelo bayesiano

bayesion_model_4 <- stan(
  model_code =  "
    data {
      int<lower=1>               obs_count;
      vector<lower=1>[obs_count] x1;
      vector<lower=1>[obs_count] x2;
      vector<lower=1>[obs_count] x3;
      vector[obs_count]          y;
    }
    parameters {
      real          beta0;
      real          beta1;
      real          beta2;
      real          beta3;
      real<lower=0> sigma;
    }
    model {
      beta0 ~ normal(0, 8000);
      beta1 ~ normal(0, 100);
      beta2 ~ normal(0, 100);
      beta3 ~ normal(0, 100);
      sigma ~ exponential(0.1);
    
      y ~ normal(beta0 + beta1 * x1 + beta2 * x2 + beta3 * x3, sigma);
    }
  ",
  data = list(
      obs_count = nrow(train_set_4),
      y  = colvalues(train_set_4, 'body_mass_g'),
      x1 = colvalues(train_set_4, 'bill_length_mm'),
      x2 = colvalues(train_set_4, 'bill_depth_mm'),
      x3 = colvalues(train_set_4, 'flipper_length_mm')
  ),
  chains = 3,
  iter   = 300,
  warmup = 180,
  thin   = 1
)
starting worker pid=16401 on localhost:11189 at 17:23:17.446
starting worker pid=16438 on localhost:11189 at 17:23:17.556
starting worker pid=16475 on localhost:11189 at 17:23:17.664

SAMPLING FOR MODEL '075149bad897790d99f42d10fc339899' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 1.2e-05 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.12 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: Iteration:   1 / 300 [  0%]  (Warmup)
Chain 1: Iteration:  30 / 300 [ 10%]  (Warmup)
Chain 1: Iteration:  60 / 300 [ 20%]  (Warmup)
Chain 1: Iteration:  90 / 300 [ 30%]  (Warmup)
Chain 1: Iteration: 120 / 300 [ 40%]  (Warmup)
Chain 1: Iteration: 150 / 300 [ 50%]  (Warmup)
Chain 1: Iteration: 180 / 300 [ 60%]  (Warmup)
Chain 1: Iteration: 181 / 300 [ 60%]  (Sampling)
Chain 1: Iteration: 210 / 300 [ 70%]  (Sampling)
Chain 1: Iteration: 240 / 300 [ 80%]  (Sampling)
Chain 1: Iteration: 270 / 300 [ 90%]  (Sampling)
Chain 1: Iteration: 300 / 300 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 0.160541 seconds (Warm-up)
Chain 1:                0.056636 seconds (Sampling)
Chain 1:                0.217177 seconds (Total)
Chain 1: 

SAMPLING FOR MODEL '075149bad897790d99f42d10fc339899' NOW (CHAIN 2).
Chain 2: 
Chain 2: Gradient evaluation took 1.2e-05 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.12 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2: 
Chain 2: 
Chain 2: Iteration:   1 / 300 [  0%]  (Warmup)
Chain 2: Iteration:  30 / 300 [ 10%]  (Warmup)
Chain 2: Iteration:  60 / 300 [ 20%]  (Warmup)
Chain 2: Iteration:  90 / 300 [ 30%]  (Warmup)
Chain 2: Iteration: 120 / 300 [ 40%]  (Warmup)
Chain 2: Iteration: 150 / 300 [ 50%]  (Warmup)
Chain 2: Iteration: 180 / 300 [ 60%]  (Warmup)
Chain 2: Iteration: 181 / 300 [ 60%]  (Sampling)
Chain 2: Iteration: 210 / 300 [ 70%]  (Sampling)
Chain 2: Iteration: 240 / 300 [ 80%]  (Sampling)
Chain 2: Iteration: 270 / 300 [ 90%]  (Sampling)
Chain 2: Iteration: 300 / 300 [100%]  (Sampling)
Chain 2: 
Chain 2:  Elapsed Time: 0.117502 seconds (Warm-up)
Chain 2:                0.031148 seconds (Sampling)
Chain 2:                0.14865 seconds (Total)
Chain 2: 

SAMPLING FOR MODEL '075149bad897790d99f42d10fc339899' NOW (CHAIN 3).
Chain 3: 
Chain 3: Gradient evaluation took 9e-06 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.09 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3: 
Chain 3: 
Chain 3: Iteration:   1 / 300 [  0%]  (Warmup)
Chain 3: Iteration:  30 / 300 [ 10%]  (Warmup)
Chain 3: Iteration:  60 / 300 [ 20%]  (Warmup)
Chain 3: Iteration:  90 / 300 [ 30%]  (Warmup)
Chain 3: Iteration: 120 / 300 [ 40%]  (Warmup)
Chain 3: Iteration: 150 / 300 [ 50%]  (Warmup)
Chain 3: Iteration: 180 / 300 [ 60%]  (Warmup)
Chain 3: Iteration: 181 / 300 [ 60%]  (Sampling)
Chain 3: Iteration: 210 / 300 [ 70%]  (Sampling)
Chain 3: Iteration: 240 / 300 [ 80%]  (Sampling)
Chain 3: Iteration: 270 / 300 [ 90%]  (Sampling)
Chain 3: Iteration: 300 / 300 [100%]  (Sampling)
Chain 3: 
Chain 3:  Elapsed Time: 0.135886 seconds (Warm-up)
Chain 3:                0.058864 seconds (Sampling)
Chain 3:                0.19475 seconds (Total)
Chain 3: 
params_4 <- c('beta0', 'beta1', 'beta2', 'beta3', 'sigma')
traceplot(bayesion_model_4, pars = params_4, inc_warmup = TRUE)

4. Coeficientes

Coeficientes de la regresión múltiple:

lineal_model_4$coefficients
      (Intercept)    bill_length_mm     bill_depth_mm flipper_length_mm 
      -7582.81982          11.48403          75.08405          50.08308 

Coeficientes descubiertos por la regresión múltiple bayesiana:

for(param in params_4) print(get_posterior_mean(bayesion_model_4, par=param)[4])
[1] -6615.826
[1] 9.556223
[1] 56.53997
[1] 47.18177
[1] 260.8093

5. Validación

vars_4 <- c('bill_length_mm', 'bill_depth_mm', 'flipper_length_mm') 

lm_vs_br_models_validation(lineal_model_4, bayesion_model_4, params_4, vars_4, test_set)
bayesion_predictor_4 <- BayesianRegressionPredictor.from(bayesion_model_4, params_4, vars_4)

plot_compare_fit(
  lineal_model_4,
  bayesion_predictor_4, 
  train_set,
  label_1='Regresion Lineal', 
  label_2='Regresion Bayesiana'
)

Experimento 5

  • Igual al experimento 1 pero proponiendo dos nuevas regresiones bayesianas con priors para los parámetros que sean:
    • Una poca informativa (uniforme).
    • Una muy informativa (sesgada o con muy poca varianza).
  • Comparar con resultados de la bayesiana del experimento A

1. Modelo bayesiano con parametro con distribucion poco informativa

1. Modelo

Definimos una distribución uniforme para el beta asociado a la variable flipper_length_mm.

starting worker pid=16677 on localhost:11189 at 17:23:47.241
starting worker pid=16714 on localhost:11189 at 17:23:47.352
starting worker pid=16751 on localhost:11189 at 17:23:47.459

SAMPLING FOR MODEL '8d4f6f44cd52d0fbdc2d079b71ab6e36' NOW (CHAIN 1).
Chain 1: Rejecting initial value:
Chain 1:   Error evaluating the log probability at the initial value.
Chain 1: Exception: exponential_lpdf: Random variable is -0.453992, but must be >= 0!  (in 'model3ad83d61fbd8_8d4f6f44cd52d0fbdc2d079b71ab6e36' at line 20)

Chain 1: 
Chain 1: Gradient evaluation took 4.1e-05 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.41 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: Iteration:   1 / 1000 [  0%]  (Warmup)
Chain 1: Iteration: 100 / 1000 [ 10%]  (Warmup)
Chain 1: Iteration: 181 / 1000 [ 18%]  (Sampling)
Chain 1: Iteration: 280 / 1000 [ 28%]  (Sampling)
Chain 1: Iteration: 380 / 1000 [ 38%]  (Sampling)
Chain 1: Iteration: 480 / 1000 [ 48%]  (Sampling)

SAMPLING FOR MODEL '8d4f6f44cd52d0fbdc2d079b71ab6e36' NOW (CHAIN 2).
Chain 2: 
Chain 2: Gradient evaluation took 4.7e-05 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.47 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2: 
Chain 2: 
Chain 2: Iteration:   1 / 1000 [  0%]  (Warmup)
Chain 1: Iteration: 580 / 1000 [ 58%]  (Sampling)

SAMPLING FOR MODEL '8d4f6f44cd52d0fbdc2d079b71ab6e36' NOW (CHAIN 3).
Chain 3: 
Chain 3: Gradient evaluation took 4.2e-05 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.42 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3: 
Chain 3: 
Chain 3: Iteration:   1 / 1000 [  0%]  (Warmup)
Chain 1: Iteration: 680 / 1000 [ 68%]  (Sampling)
Chain 2: Iteration: 100 / 1000 [ 10%]  (Warmup)
Chain 1: Iteration: 780 / 1000 [ 78%]  (Sampling)
Chain 2: Iteration: 181 / 1000 [ 18%]  (Sampling)
Chain 1: Iteration: 880 / 1000 [ 88%]  (Sampling)
Chain 3: Iteration: 100 / 1000 [ 10%]  (Warmup)
Chain 2: Iteration: 280 / 1000 [ 28%]  (Sampling)
Chain 1: Iteration: 980 / 1000 [ 98%]  (Sampling)
Chain 1: Iteration: 1000 / 1000 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 0.654554 seconds (Warm-up)
Chain 1:                2.31789 seconds (Sampling)
Chain 1:                2.97244 seconds (Total)
Chain 1: 
Chain 3: Iteration: 181 / 1000 [ 18%]  (Sampling)
Chain 2: Iteration: 380 / 1000 [ 38%]  (Sampling)
Chain 2: Iteration: 480 / 1000 [ 48%]  (Sampling)
Chain 3: Iteration: 280 / 1000 [ 28%]  (Sampling)
Chain 2: Iteration: 580 / 1000 [ 58%]  (Sampling)
Chain 3: Iteration: 380 / 1000 [ 38%]  (Sampling)
Chain 2: Iteration: 680 / 1000 [ 68%]  (Sampling)
Chain 2: Iteration: 780 / 1000 [ 78%]  (Sampling)
Chain 3: Iteration: 480 / 1000 [ 48%]  (Sampling)
Chain 2: Iteration: 880 / 1000 [ 88%]  (Sampling)
Chain 3: Iteration: 580 / 1000 [ 58%]  (Sampling)
Chain 2: Iteration: 980 / 1000 [ 98%]  (Sampling)
Chain 2: Iteration: 1000 / 1000 [100%]  (Sampling)
Chain 2: 
Chain 2:  Elapsed Time: 0.890608 seconds (Warm-up)
Chain 2:                3.09286 seconds (Sampling)
Chain 2:                3.98347 seconds (Total)
Chain 2: 
Chain 3: Iteration: 680 / 1000 [ 68%]  (Sampling)
Chain 3: Iteration: 780 / 1000 [ 78%]  (Sampling)
Chain 3: Iteration: 880 / 1000 [ 88%]  (Sampling)
Chain 3: Iteration: 980 / 1000 [ 98%]  (Sampling)
Chain 3: Iteration: 1000 / 1000 [100%]  (Sampling)
Chain 3: 
Chain 3:  Elapsed Time: 1.14506 seconds (Warm-up)
Chain 3:                4.13633 seconds (Sampling)
Chain 3:                5.28139 seconds (Total)
Chain 3: 

2. Coeficientes

br_vs_br_coeficients(bayesion_model_1, bayesion_model_5, params_5)

3. Validación

lm_vs_br_models_validation(lineal_model_1, bayesion_model_1, params_1, vars_1, test_set)

4. Validacion

vars_5 <- c('bill_length_mm', 'bill_depth_mm', 'flipper_length_mm') 

lm_vs_br_models_validation(lineal_model_1, bayesion_model_5, params_5, vars_5, test_set)
bayesion_predictor_5 <- BayesianRegressionPredictor.from(bayesion_model_5, params_5, vars_5)

plot_compare_fit(
  bayesion_predictor_1,
  bayesion_predictor_5,
  train_set,
  label_1='Regresion Bayesiana con dist informativa', 
  label_2='Regresion Bayesiana con dist menos informativa'
)

2. Modelo bayesiano con parametro con distribucion muy informativa sesgada o con poca varianza

COMPLETAR

---
title: "Enfoque Estadistico del Aprendizaje - Trabajo Final - Regresion Bayesiana"
fig_width: 3 
fig_height: 3 
output:
  html_document:
    highlight: pygments
    theme: sandstone
    toc: yes
    df_print: paged
    includes:
      before_body: ./header.html
  html_notebook: 
    toc: yes
    toc_float: yes
    df_print: paged
---

### Adrian Marino

### Claudio Collado


# Inicialización

Fijamos la semilla para poder reproducir los experimentos. También se fija el numero de CPU's a utilizar.

```{r}
set.seed(42)
options(mc.cores = 24)
```


# Librerias

Se importan las librerías a utilizar a lo largo de la notebook:

```{r message=FALSE, warning=FALSE}
# install.packages(pacman)
# install.packages("https://cran.r-project.org/src/contrib/rstan_2.21.2.tar.gz",repos = NULL,type="source")
# sudo apt-get install libglpk-dev
```

```{r message=FALSE, warning=FALSE}
library(pacman)
p_load(tidyverse, tidymodels, rsample, rstan, shinystan, rstanarm, devtools)
p_load_gh('adrianmarino/commons')

import('../src/dataset.R')
import('../src/plot.R')
import('../src/model.R')
```

# Dataset y Analisis Exploratorio

![Palmer Penguins](https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/man/figures/lter_penguins.png)

[Palmer Penguins](https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-07-28/readme.md)


## 1. Lectura del dataset


```{r message=FALSE, warning=FALSE}
dataset <- load_dataset() %>% mutate_if(is.character, as.factor)

dataset %>% glimpse()
```


## 2. Variables

Se enumeran y describen breve-mente cada variable que forma parte del dataset:

Variables Numéricas:

* **bill_length_mm**: Longitud del pico del individuo medida en milímetros (también conocida como longitud del culmen).
* **bill_depth_mm**: Profundidad del pico del individuo medida en milímetros (también conocida como profundidad del culmen).
* **flipper_length_mm**: Longitud de la aleta del individuo medida en milímetros.
* **body_mass_g**: Masa corporal del individuo medida en gramos.
* **year**: Año en el que se registra el individuo.

Variables Categóricas:

* **species**: Especie del individuo (Adelie, Gentoo ;) o Chinstrap).
* **sex**: Sexo del individuo.
* **island**: Isla donde se encontré el individuo (Biscoe, Dream o Torgersen).

A continuación veamos los posibles valores de las variables categóricas:

```{r}
show_values(dataset %>% select_if(negate(is.numeric)))
```

## 3. Resumen de faltantes

```{r message=FALSE,warning=FALSE}
missings_summary(dataset)
```

## 4. Varibles numericas

```{r fig.height=2, fig.width=3, message=TRUE, warning=FALSE}
hist_plots(dataset)
```


**Observaciones**

* Se aprecia que cada año se registro prácticamente el mismo numero de individuos.
* La distribución de la masa corporal de los individuos tiene una asimétrica positiva. Tenemos muchos individuos con valores bajos de masa corporal, con una media de 192 gramos. Luego tenemos menos individuos con valores mas alto.
* La longitud de la aleta parece ser una distribución bi-modal. Tenemos dos modas una 192 mm y otra en 215 mm.
* La longitud del pico también parece tener una ligera simetría positiva. Es decir que lo individuos con menor peso tiene pico mas pequeños.
* Por otro lado la profundidad de pico parece tener una ligera simetría positiva.


```{r fig.align='center', fig.height=5, fig.width=8, message=FALSE, warning=FALSE, fig.align='center'}
box_plots(dataset)
```


**Observaciones**

* COMPLETAR


### Outliers

No se registran valores mas extremos que el mínimo y máximo valor en cada variables. Es decir que no encontramos outliers.

```{r, fig.show='hide'}
outliers(dataset, column='bill_length_mm')
```


```{r, fig.show='hide'}
outliers(dataset, column='bill_length_mm')
```


```{r, fig.show='hide'}
outliers(dataset, column='bill_depth_mm')
```

```{r, fig.show='hide'}
outliers(dataset, column='flipper_length_mm')
```

```{r, fig.show='hide'}
outliers(dataset, column='body_mass_g')
```


```{r, fig.show='hide'}
outliers(dataset, column='year')
```


```{r fig.height=3, fig.width=3, warning=FALSE}
bar_plots(dataset)
```

**Observaciones**

* La variable sexo se encuentra balanceada. Por otro lado, contiene algunos valores faltantes.
* La variable island esta completamente desbalanceada. Esto seguramente se debe a una diferencia en numero
en las poblaciones en cada isla o a un sesgo al momento de registrar los individuos. Es decir que registramos con individuos en una isla que en otra.
* Lo mismo sucede con las especies de individuos. Vemos un gran desbalance entre la especie Chinstrap vs. otra especies. Por otro aldo Adelie y Gentoo tiene un conteo mas cercano


## 5. Excluir observaciones con missings

```{r}
dataset <- dataset %>% drop_na()
missings_summary(dataset)
```

## 6. Correlaciones

```{r fig.align='center', fig.height=5, fig.width=5, message=FALSE, warning=FALSE, fig.align='center'}
corr_plot(dataset %>% dplyr::select(-year))
```

```{r fig.align='center', fig.height=12, fig.width=12, message=FALSE, warning=FALSE, fig.align='center'}
segmented_pairs_plot(dataset, segment_column='species')
```

# Experimentos

## Experimento 1

* Solo variables numéricas.
* Regresión múltiple frecuentista.
* Regresión múltiple bayesiana con priors normales y exponencial.


### 1. Split train vs. test


```{r message=FALSE, warning=FALSE}
train_test <- train_test_split(dataset, train_size = 0.7, shuffle = TRUE)
train_set <- train_test[[1]]
test_set  <- train_test[[2]]
```


### 2. Modelo lineal

```{r}
lineal_model_1 <- lm(
  body_mass_g ~ bill_length_mm + bill_depth_mm + flipper_length_mm,
  data = train_set
)
```

### 3. Modelo bayesiano

```{r fig.align='center', fig.height=4, fig.width=8, message=FALSE, warning=FALSE, fig.align='center'}
bayesion_model_1 <- stan(
  model_code =  "
    data {
      int<lower=1>               obs_count;
      vector<lower=1>[obs_count] x1;
      vector<lower=1>[obs_count] x2;
      vector<lower=1>[obs_count] x3;
      vector[obs_count]          y;
    }
    parameters {
      real          beta0;
      real          beta1;
      real          beta2;
      real          beta3;
      real<lower=0> sigma;
    }
    model {
      beta0 ~ normal(0, 8000);
      beta1 ~ normal(0, 100);
      beta2 ~ normal(0, 100);
      beta3 ~ normal(0, 100);
      sigma ~ exponential(0.1);
    
      y ~ normal(beta0 + beta1 * x1 + beta2 * x2 + beta3 * x3, sigma);
    }
  ",
  data = list(
      obs_count = nrow(train_set),
      y  = colvalues(train_set, 'body_mass_g'),
      x1 = colvalues(train_set, 'bill_length_mm'),
      x2 = colvalues(train_set, 'bill_depth_mm'),
      x3 = colvalues(train_set, 'flipper_length_mm')
  ),
  chains = 3,
  iter   = 300,
  warmup = 180,
  thin   = 1
)

params_1 <- c('beta0', 'beta1', 'beta2', 'beta3', 'sigma')
traceplot(bayesion_model_1, pars = params_1, inc_warmup = TRUE)
```


### 4. Coeficientes

```{r}
lm_vs_br_coeficients(lineal_model_1, bayesion_model_1, params_1)
```

### 4. Validación


```{r}
vars_1 <- c('bill_length_mm', 'bill_depth_mm', 'flipper_length_mm') 

lm_vs_br_models_validation(lineal_model_1, bayesion_model_1, params_1, vars_1, test_set)
```

```{r fig.align='center', fig.height=3, fig.width=8, message=FALSE, warning=FALSE, fig.align='center'}
bayesion_predictor_1 <- BayesianRegressionPredictor.from(bayesion_model_1, params_1, vars_1)

plot_compare_fit(
  lineal_model_1, 
  bayesion_predictor_1, 
  train_set,
  label_1='Regresion Lineal', 
  label_2='Regresion Bayesiana'
)
```


## Experimento 2

* Idem a experimento 1, incorporando una variable categórica.
* Regresión múltiple frecuentista.
* Regresión bayesiana con priors normales y exponencial.


### 1. Modelo lineal


```{r}
lineal_model_2 <- lm(
    body_mass_g 
      ~ bill_length_mm
      + bill_depth_mm
      + flipper_length_mm
      + sex,

  data = train_set
)
```

### 2.  Modelo bayesiano

Antes que nada transformamos la columna categórica a one-hot encoding:

```{r}
cat_cols <- c('sex')

train_set_cat <- train_set %>% dummify(cat_cols)
test_set_cat  <- test_set  %>% dummify(cat_cols)

train_set_cat
```


Construimos una matriz con todas las variables(X) mas el intercept:

```{r}
to_model_input <- function(df) {
  model.matrix(
    body_mass_g 
      ~ bill_length_mm
      + bill_depth_mm
      + flipper_length_mm
      + sex_female
      + sex_male,

    data = df
  )
}

train_X <- train_set_cat %>% to_model_input()
test_X  <- test_set_cat %>% to_model_input()
```


Definimos y corremos el modelo bayesiano:

```{r fig.align='center', fig.height=4, fig.width=8, message=FALSE, warning=FALSE, fig.align='center'}
bayesion_model_2 <- stan(
  model_code = "
    data {
      int<lower=1>                 obs_count;
      int<lower=1>                 coef_count;
      matrix[obs_count,coef_count] X;
      vector[obs_count]            y;
    }
    parameters {
      vector[coef_count]  beta;
      real<lower=0>       sigma;
    }
    model {
      beta[1] ~ normal(0, 2000);
      
      beta[2] ~ normal(0, 30);
      beta[3] ~ normal(0, 100);
      beta[4] ~ normal(0, 100);
  
      beta[5] ~ normal(0, 100);
      beta[6] ~ normal(0, 100);
  
      sigma ~ exponential(0.1);
  
      y ~ normal(X * beta, sigma);
    }
  ",
  data = list(
      obs_count  = dim(train_X)[1],
      coef_count = dim(train_X)[2],
      y          = colvalues(train_set_cat, 'body_mass_g'),
      X          = train_X
  ),
  chains = 3,
  iter   = 300,
  warmup = 180,
  thin   = 1
)

params_2 <- c('beta[1]', 'beta[2]', 'beta[3]', 'beta[4]', 'beta[5]', 'beta[6]', 'sigma')
traceplot(bayesion_model_2, inc_warmup = TRUE, pars = params_2)
```


### 3. Coeficientes

```{r}
lineal_model_2$coefficients
```

```{r}
br_coeficients(bayesion_model_2, params_2)
```

### 4. Validación

```{r}
vars_2 <- c('bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'sex_female','sex_male')

lm_vs_br_models_validation(
  lineal_model_2, 
  bayesion_model_2, 
  params_2,
  vars_2,
  test_set, 
  test_set_cat
)
```

```{r fig.align='center', fig.height=3, fig.width=8, message=FALSE, warning=FALSE, fig.align='center'}
bayesion_predictor_2 <- BayesianRegressionPredictor.from(bayesion_model_2, params_2, vars_2)

plot_compare_fit(
  lineal_model_2, 
  bayesion_predictor_2, 
  test_set, 
  test_set_cat,
  label_1='Regresion Lineal', 
  label_2='Regresion Bayesiana'
)
```

## Experimento 3

* Idem a experimento 1 incorporando outliers en alguna variable numérica.
* Regresión múltiple frecuentista.
* Regresión bayesiana con priors normales y exponencial.


### 1. Outliers

A continuación vamos a agregar outliers a la variable **flipper_length_mm**, la cual define la longitud de la aleta del individuo medida en milímetros. 

Para visualizar los nuevos outliers a continuación graficamos **flipper_length_mm** vs.  **body_mass_g** antes y después de agregar outliers:

```{r fig.align='center', fig.height=3, fig.width=5, message=FALSE, warning=FALSE, fig.align='center'}
plot_data(train_set)
```

```{r fig.align='center', fig.height=3, fig.width=5, message=FALSE, warning=FALSE, fig.align='center'}
train_set_with_outliers <- train_set %>%
  mutate(flipper_length_mm = ifelse(
    body_mass_g > 4900 & body_mass_g < 5000, 
    flipper_length_mm + (flipper_length_mm * runif(1, 0.1, 0.2)), 
    flipper_length_mm
  ))

plot_data(train_set_with_outliers)
```

### 2. Modelo lineal

```{r}
lineal_model_3 <- lm(
  body_mass_g ~ bill_length_mm + bill_depth_mm + flipper_length_mm,
  data = train_set_with_outliers
)
```


Comparemos el ajuste del modelo lineal ajustando en un dataset de train con y sin outliers:

```{r fig.align='center', fig.height=3, fig.width=8, message=FALSE, warning=FALSE, fig.align='center'}
plot_compare_fit(
  lineal_model_1, 
  lineal_model_3, 
  train_set_with_outliers,
  label_1='Regresión Lineal SIN outliers', 
  label_2='Regresión Lineal CON outliters'
)
```
 
Se aprecia que el modelo entrenado en el train set outliers esta apalancado por las observaciones atipicas.


### 2. Modelo lineal Robusto

Entrenamos una regresión lineal múltiple robusta para intentar de disminuir el efecto de los nuevos outliers.

```{r}
p_load(MASS)
robust_lineal_model_3 <- rlm(
  body_mass_g ~ bill_length_mm + bill_depth_mm + flipper_length_mm,
  data = train_set_with_outliers
)
```


```{r fig.align='center', fig.height=3, fig.width=8, message=FALSE, warning=FALSE, fig.align='center'}
plot_compare_fit(
  lineal_model_1, 
  robust_lineal_model_3, 
  train_set_with_outliers,
  label_1='Regresión Lineal SIN outliers', 
  label_2='Regresión Lineal Robusta CON outliters'
)
```

Gráficamente no se llega a distinguir pero el modelo robusto termina ajustando mejor que el modelo lineal clásico. Esto se puede observar cuando comparamos el RMSE/MAE sobre train.


```{r}
lm_vs_lm_models_validation(lineal_model_3, robust_lineal_model_3, train_set_with_outliers)
```

Mas alla de esto el modelo clasico sigue dando mejores resultado en test:

```{r}
lm_vs_lm_models_validation(lineal_model_3, robust_lineal_model_3, test_set)
```


### 3. Modelo bayesiano con Likelihood normal


```{r fig.align='center', fig.height=4, fig.width=8, message=FALSE, warning=FALSE, fig.align='center'}
bayesion_model_3 <- stan(
  model_code =  "
    data {
      int<lower=1>               obs_count;
      vector<lower=1>[obs_count] x1;
      vector<lower=1>[obs_count] x2;
      vector<lower=1>[obs_count] x3;
      vector[obs_count]          y;
    }
    parameters {
      real          beta0;
      real          beta1;
      real          beta2;
      real          beta3;
      real<lower=0> sigma;
    }
    model {
      beta0 ~ normal(0, 8000);
      beta1 ~ normal(0, 100);
      beta2 ~ normal(0, 100);
      beta3 ~ normal(0, 100);
      sigma ~ exponential(0.1);
    
      y ~ normal(beta0 + beta1 * x1 + beta2 * x2 + beta3 * x3, sigma);
    }
  ",
  data = list(
      obs_count = nrow(train_set_with_outliers),
      y  = colvalues(train_set_with_outliers, 'body_mass_g'),
      x1 = colvalues(train_set_with_outliers, 'bill_length_mm'),
      x2 = colvalues(train_set_with_outliers, 'bill_depth_mm'),
      x3 = colvalues(train_set_with_outliers, 'flipper_length_mm')
  ),
  chains = 3,
  iter   = 300,
  warmup = 180,
  thin   = 1
)

params_3 <- c('beta0', 'beta1', 'beta2', 'beta3', 'sigma')
traceplot(bayesion_model_3, pars = params_3, inc_warmup = TRUE)
```


### 4. Coeficientes

```{r}
lm_vs_br_coeficients(robust_lineal_model_3, bayesion_model_3, params_3)
```

### 5. Validación


```{r}
vars_3 <- c('bill_length_mm', 'bill_depth_mm', 'flipper_length_mm') 

lm_vs_br_models_validation(robust_lineal_model_3, bayesion_model_3, params_3, vars_3, test_set)
```

```{r fig.align='center', fig.height=3, fig.width=8, message=FALSE, warning=FALSE, fig.align='center'}
bayesion_predictor_3 <- BayesianRegressionPredictor.from(bayesion_model_3, params_3, vars_3)

plot_compare_fit(
  robust_lineal_model_3, 
  bayesion_predictor_3, 
  train_set,
  label_1='Regresion Lineal Robusta CON outliers', 
  label_2='Regresion Bayesiana CON outliers'
)
```


### 6. Modelo bayesiano con Likelihood t-student


## Experimento 4


* Idem experimento 1 pero reduciendo la cantidad de observaciones a pocos valores (ej:30).
* Regresion multiple frecuentista.
* Regresion bayesiana con priors normales y exponencial.


### 1. Split train - test

En este aso entrenamos solo con el 10% de lo datos.

```{r message=FALSE, warning=FALSE}
train_test <- train_test_split(dataset, train_size = 0.05, shuffle = TRUE)
train_set_4 <- train_test[[1]]
test_set_4  <- train_test[[2]]
```

```{r fig.align='center', fig.height=3, fig.width=5, message=FALSE, warning=FALSE, fig.align='center'}
plot_data(train_set_4)
```

### 2. Modelo lineal

```{r}
lineal_model_4 <- lm(
  body_mass_g ~ bill_length_mm + bill_depth_mm + flipper_length_mm,
  data = train_set_4
)
```

### 3. Modelo bayesiano

```{r fig.align='center', fig.height=4, fig.width=8, message=FALSE, warning=FALSE, fig.align='center'}
bayesion_model_4 <- stan(
  model_code =  "
    data {
      int<lower=1>               obs_count;
      vector<lower=1>[obs_count] x1;
      vector<lower=1>[obs_count] x2;
      vector<lower=1>[obs_count] x3;
      vector[obs_count]          y;
    }
    parameters {
      real          beta0;
      real          beta1;
      real          beta2;
      real          beta3;
      real<lower=0> sigma;
    }
    model {
      beta0 ~ normal(0, 8000);
      beta1 ~ normal(0, 100);
      beta2 ~ normal(0, 100);
      beta3 ~ normal(0, 100);
      sigma ~ exponential(0.1);
    
      y ~ normal(beta0 + beta1 * x1 + beta2 * x2 + beta3 * x3, sigma);
    }
  ",
  data = list(
      obs_count = nrow(train_set_4),
      y  = colvalues(train_set_4, 'body_mass_g'),
      x1 = colvalues(train_set_4, 'bill_length_mm'),
      x2 = colvalues(train_set_4, 'bill_depth_mm'),
      x3 = colvalues(train_set_4, 'flipper_length_mm')
  ),
  chains = 3,
  iter   = 300,
  warmup = 180,
  thin   = 1
)

params_4 <- c('beta0', 'beta1', 'beta2', 'beta3', 'sigma')
traceplot(bayesion_model_4, pars = params_4, inc_warmup = TRUE)
```

### 4. Coeficientes

Coeficientes de la regresión múltiple:


```{r}
lineal_model_4$coefficients
```

Coeficientes descubiertos por la regresión múltiple bayesiana:

```{r}
for(param in params_4) print(get_posterior_mean(bayesion_model_4, par=param)[4])
```


### 5. Validación


```{r}
vars_4 <- c('bill_length_mm', 'bill_depth_mm', 'flipper_length_mm') 

lm_vs_br_models_validation(lineal_model_4, bayesion_model_4, params_4, vars_4, test_set)
```

```{r fig.align='center', fig.height=3, fig.width=5, message=FALSE, warning=FALSE, fig.align='center'}
bayesion_predictor_4 <- BayesianRegressionPredictor.from(bayesion_model_4, params_4, vars_4)

plot_compare_fit(
  lineal_model_4,
  bayesion_predictor_4, 
  train_set,
  label_1='Regresion Lineal', 
  label_2='Regresion Bayesiana'
)
```


## Experimento 5

* Igual al experimento 1 pero proponiendo dos nuevas regresiones bayesianas con priors para los parámetros que sean:
  * Una poca informativa (uniforme).
  * Una muy informativa (sesgada o con muy poca varianza).
* Comparar con resultados de la bayesiana del experimento A

### 1. Modelo bayesiano con parametro con distribucion poco informativa

#### 1. Modelo

Definimos una [distribución uniforme](https://mc-stan.org/docs/2_21/functions-reference/uniform-distribution.html) para el beta asociado a la variable **flipper_length_mm**.

```{r fig.align='center', fig.height=4, fig.width=8, message=FALSE, warning=FALSE, fig.align='center', echo=FALSE}
bayesion_model_5 <- stan(
  model_code =  "
    data {
      int<lower=1>               obs_count;
      vector<lower=1>[obs_count] x1;
      vector<lower=1>[obs_count] x2;
      vector<lower=1>[obs_count] x3;
      vector[obs_count]          y;
    }
    parameters {
      real          beta0;
      real          beta1;
      real          beta2;
      real          beta3;
      real<lower=0> sigma;
    }
    model {
      beta0 ~ normal(0, 8000);
      beta1 ~ normal(0, 100);
      beta2 ~ normal(0, 100);
      beta3 ~ exponential(0.1);
      sigma ~ exponential(0.5);
    
      y ~ normal(beta0 + beta1 * x1 + beta2 * x2 + beta3 * x3, sigma);
    }
  ",
  data = list(
      obs_count = nrow(train_set),
      y  = colvalues(train_set, 'body_mass_g'),
      x1 = colvalues(train_set, 'bill_length_mm'),
      x2 = colvalues(train_set, 'bill_depth_mm'),
      x3 = colvalues(train_set, 'flipper_length_mm')
  ),
  chains = 3,
  iter   = 1000,
  warmup = 180,
  thin   = 1
)

params_5 <- c('beta0', 'beta1', 'beta2', 'beta3', 'sigma')
traceplot(bayesion_model_5, pars = params_5, inc_warmup = TRUE)
```


#### 2. Coeficientes


```{r}
br_vs_br_coeficients(bayesion_model_1, bayesion_model_5, params_5)
```


#### 3. Validación

```{r}
lm_vs_br_models_validation(lineal_model_1, bayesion_model_1, params_1, vars_1, test_set)
```
#### 4. Validacion

```{r}
vars_5 <- c('bill_length_mm', 'bill_depth_mm', 'flipper_length_mm') 

lm_vs_br_models_validation(lineal_model_1, bayesion_model_5, params_5, vars_5, test_set)
```

```{r fig.align='center', fig.height=3, fig.width=10, message=FALSE, warning=FALSE, fig.align='center'}
bayesion_predictor_5 <- BayesianRegressionPredictor.from(bayesion_model_5, params_5, vars_5)

plot_compare_fit(
  bayesion_predictor_1,
  bayesion_predictor_5,
  train_set,
  label_1='Regresion Bayesiana con dist informativa', 
  label_2='Regresion Bayesiana con dist menos informativa'
)
```

### 2. Modelo bayesiano con parametro con distribucion muy informativa sesgada o con poca varianza

COMPLETAR




## Referencias

* [Making Predictions from Stan models in R](https://medium.com/@alex.pavlakis/making-predictions-from-stan-models-in-r-3e349dfac1ed)
* [How to represent a categorical predictor rstan?](https://stackoverflow.com/questions/29183577/how-to-represent-a-categorical-predictor-rstan)
* [R commons](https://github.com/adrianmarino/commons)

